Group CC901E2
The University of Sydney
The aim for the model is to predict the House Price in New York
The data set was original sourced from the mosaic Data package in R
The data set contains information about New York Houses including prices of Houses (in Us Dollars), lot size of the house (acres), number of bedrooms and bathrooms and the type of heating system
Rows: 1,734
Columns: 16
$ Price <int> 132500, 181115, 109000, 155000, 86060, 120000, 153000, 1…
$ Lot.Size <dbl> 0.09, 0.92, 0.19, 0.41, 0.11, 0.68, 0.40, 1.21, 0.83, 1.…
$ Waterfront <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Age <int> 42, 0, 133, 13, 0, 31, 33, 23, 36, 4, 123, 1, 13, 153, 9…
$ Land.Value <int> 50000, 22300, 7300, 18700, 15000, 14000, 23300, 14600, 2…
$ New.Construct <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Central.Air <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ Fuel.Type <fct> Electric, Gas, Gas, Gas, Gas, Gas, Oil, Oil, Electric, G…
$ Heat.Type <fct> Electric, Hot Water, Hot Water, Hot Air, Hot Air, Hot Ai…
$ Sewer.Type <fct> Private, Private, Public, Private, Public, Private, Priv…
$ Living.Area <int> 906, 1953, 1944, 1944, 840, 1152, 2752, 1662, 1632, 1416…
$ Pct.College <int> 35, 51, 51, 51, 51, 22, 51, 35, 51, 44, 51, 51, 41, 57, …
$ Bedrooms <int> 2, 3, 4, 3, 2, 4, 4, 4, 3, 3, 7, 3, 2, 3, 3, 3, 3, 4, 2,…
$ Fireplaces <int> 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ Bathrooms <dbl> 1.0, 2.5, 1.0, 1.5, 1.0, 1.0, 1.5, 1.5, 1.5, 1.5, 1.0, 2…
$ Rooms <int> 5, 6, 8, 5, 3, 8, 8, 9, 8, 6, 12, 6, 4, 5, 8, 4, 7, 12, …
This data set does not contain any missing data however, outliers do exist in this data set. For example a 0- acre lot size cannot exist. Also, it is unlikely that $5000 USD would be enough to purchase a House in New York in 2006.
| columns | Mean | SD | Max | Min |
|---|---|---|---|---|
| price | 211545.05 | 98553.81 | 775000.00 | 5000.00 |
| lot size | 0.50 | 0.70 | 12.20 | 0.00 |
| age | 28.26 | 29.86 | 225.00 | 0.00 |
| land value | 34536.23 | 34980.94 | 412600.00 | 200.00 |
| pct college | 55.57 | 10.32 | 82.00 | 20.00 |
| living area | 1752.63 | 620.22 | 5228.00 | 616.00 |
| bedrooms | 3.15 | 0.82 | 7.00 | 1.00 |
| rooms | 7.03 | 2.32 | 12.00 | 2.00 |
Since our dependent variable is ‘Price’, based on the graph, it can easily found that the variable ‘Living.Area’ may be the most affected by ‘Price’ (0.71, with dark pink).
Dependent variable: Price
Independent variable: Living.Area
Call:
lm(formula = Price ~ Living.Area, data = house)
Coefficients:
(Intercept) Living.Area
12844.2 113.4
The residuals \(\varepsilon_i\)are iid\({\mathcal N}(0,\sigma^2)\) and there is a linear relationship between y and x.
Linearity: The Auxiliary line is reasonably well plotted like a straight line (with no obvious curve), so there is no obvious pattern in the residual vs fitted values plot.
Homoskedasticity: It appears the residuals are getting spread-out and do not appear to be fanning out or changing their variability over the range of the fitted values so the constant error variance assumption is met.
Normality: in the QQ plot, the points are reasonably close to the diagonal line. Although there seems to be some outliers exist, the data-set is relatively large(with 1733 observations). Thus, the normality assumption is approximately satisfied.
Conclusion:
As the assumption all met, it can be concluded that our simple estimated model is \(\widehat{Price} = 12844.179 + 113.373 \times Living.Area\)
| x | |
|---|---|
| (Intercept) | 7740.537 |
| Lot.Size | 7372.038 |
| Waterfront1 | 120327.799 |
| Age | -140.797 |
| Land.Value | 0.920 |
| New.Construct1 | -44544.524 |
| Central.Air1 | 9639.198 |
| Heat.TypeHot Air | 9998.556 |
| Heat.TypeHot Water | -511.478 |
| Heat.TypeNone | -32952.329 |
| Living.Area | 70.172 |
| Bedrooms | -7797.557 |
| Bathrooms | 23048.177 |
| Rooms | 3045.908 |
| x | |
|---|---|
| (Intercept) | 7740.537 |
| Living.Area | 70.172 |
| Land.Value | 0.920 |
| Bathrooms | 23048.177 |
| Waterfront1 | 120327.799 |
| New.Construct1 | -44544.524 |
| Heat.TypeHot Air | 9998.556 |
| Heat.TypeHot Water | -511.478 |
| Heat.TypeNone | -32952.329 |
| Lot.Size | 7372.038 |
| Central.Air1 | 9639.198 |
| Age | -140.797 |
| Rooms | 3045.908 |
| Bedrooms | -7797.557 |
Price ~ Lot.Size + Waterfront1 + Age + Land.Value + New.Construct1 + Central.Air1 + Fuel.TypeGas + Fuel.TypeOil + Heat.TypeHot Water + Heat.TypeNone + Living.Area + Bedrooms + Bathrooms + Rooms
Land.Value, Living.Area, Bathrooms, New.Construct, Waterfront are the five most important variables for predicting Price.HeatTypeHot.Air line indicates that a group of variables contains similar information to it.Sewer.TypePublic and all levels of FuelType lie below the path of redundant variable, which means they are included in models by chance (don’t provide any useful information). name prob
Price~1 1.00
Price~Living.Area 1.00
Price~Land.Value+Living.Area 1.00
Price~Land.Value+Living.Area+Bathrooms 0.56
Price~Waterfront1+Land.Value+Living.Area+Bathrooms 0.78
Price~Waterfront1+Land.Value+New.Construct1+Living.Area+Bathrooms 0.57
logLikelihood
-22398.09
-21781.28
-21595.32
-21560.23
-21529.76
-21513.72
Which model is better?
\[\widehat{Price} = 12844.179 + 113.373 \times Living.Area\]
| x | |
|---|---|
| (Intercept) | 7740.537 |
| Lot.Size | 7372.038 |
| Waterfront1 | 120327.799 |
| Age | -140.797 |
| Land.Value | 0.920 |
| New.Construct1 | -44544.524 |
| Central.Air1 | 9639.198 |
| Heat.TypeHot Air | 9998.556 |
| Heat.TypeHot Water | -511.478 |
| Heat.TypeNone | -32952.329 |
| Living.Area | 70.172 |
| Bedrooms | -7797.557 |
| Bathrooms | 23048.177 |
| Rooms | 3045.908 |
| x | |
|---|---|
| (Intercept) | 7740.537 |
| Living.Area | 70.172 |
| Land.Value | 0.920 |
| Bathrooms | 23048.177 |
| Waterfront1 | 120327.799 |
| New.Construct1 | -44544.524 |
| Heat.TypeHot Air | 9998.556 |
| Heat.TypeHot Water | -511.478 |
| Heat.TypeNone | -32952.329 |
| Lot.Size | 7372.038 |
| Central.Air1 | 9639.198 |
| Age | -140.797 |
| Rooms | 3045.908 |
| Bedrooms | -7797.557 |
| x | |
|---|---|
| (Intercept) | 3136.912 |
| Waterfront1 | 123131.166 |
| Land.Value | 0.912 |
| Living.Area | 71.031 |
| Bathrooms | 27058.455 |
| Models | r-square | Adjusted r-square |
|---|---|---|
| Simple | 0.509 | 0.509 |
| Backward | 0.655 | 0.652 |
| Forward | 0.655 | 0.652 |
| Stable | 0.633 | 0.632 |
As shown, the stable model has relatively higher error rates and slightly lower \(r^2\) and adjusted-\(r^2\) values than the backward and forward models. It is a compromise between accuracy and stability
If more information about the dataset is provided, a domain knowledge expert may make better judgement of which model to choose